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Abstract 

Observational astronomy has changed drastically in the last decade: manually 
driven target-by-target instruments have been replaced by fully automated robotic 
telescopes. Data acquisition methods have advanced to the point that terabytes of 
data are flowing in and being stored on a daily basis. At the same time, the vast 
majority of analysis tools in stellar astrophysics still rely on manual expert interaction. 
To bridge this gap, we foresee that the next decade will witness a fundamental shift in 
the approaches to data analysis: case-by-case methods will be replaced by fully auto- 
mated pipelines that will process the data from their reduction stage, through analysis, 
to storage. While major effort has been invested in data reduction automation, auto- 
mated data analysis has mostly been neglected despite the urgent need. Scientific data 
mining will face serious challenges to identify, understand and eliminate the sources of 
systematic errors that will arise from this automation. As a special case, we present 
an artificial intelligence (AI) driven pipeline that is prototyped in the domain of stellar 
astrophysics (eclipsing binaries in particular), current results and the challenges still 
ahead. 

Villanova University; ^Vanderbilt University; '^University of Hawaii; ^Eastern University; 
Harvard-Smithsonian CfA; ^NOAO; ''University of Ljubljana; ^STSCI; ^University of Florida; 
oUniversity of Barcelona; "INTA/CSIC. 
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1 Introduction 



One of the most important changes in observational astronomy of the 2P* Century is a 
rapid shift from classical object-by-object observations to extensive automated surveys. 
As CCD detectors improve in sensitivity and their costs decrease, more and more small 
and medium-size observatories are refocusing their attentioE0 to the investigation of stellar 
variability through systematic wide-field sky-scanning mission s. This trend is additionally 
powe red by the success of pioneer ing surve ys such as EROS (IPalanque-Delabrouille et al. 



19981), MACHO fICook et all 119951 ). OGLE flUdalski et al 



their space counterpart Hipparcos (jPerryman fc ESAl 119971 ) and others. Such surveys pro 



19971), ASAS flPoimanskil 120021 ) 



duce massive amounts of data that pose a significant challen ge to reduction and analy- 
sis. Surveys and missions cur rently comm issioned fi.e. Kepler f|Borucki et al.l 120071 ). LSST 



f lTvsonl 120021 ). Pan-STARRS flKaised I2004D and Gaia flPerrvman et all 12001 



) wil l produce 



20061). SEGUE 



petabytes of data daily; spectroscopic surveys such as RAV E ( iSteinmetz et al. 
( Newberg &: Sloan Digital Sky Survey Collaborationll2003 ) and Hermes ( Raskin fc Van Winckel 
20081 ) will open the doors for complementary spectroscopy for millions of sources. Yet 
currently-available tools fall short of the needs of proper analysis. 

In this white paper we limit ourselves to stellar astrophysics (in particular, eclipsing 
binary stars - EEs hereafter), but the points raised are readily applicable to other areas of 
astronomy, such as the study of pulsating variable stars, astroseismology, stellar rotation, 
population theory, etc. To date, about a thousand papers have been published on EEs 
with physical and geometrical parameters determined to better than 3% accuracy. For an 
eclipsing binary expert it takes 1-2 weeks to reduce and analyze a single eclipsing binary 
light-curve the old-fashioned way. There are currently about 10,000 photometric/RV data- 
sets that in principle allow modeling to a 3% accuracy. According to Hippa rcos results, about 
0.8% of the overall stellar population are EEs (917 out of 118,218 stars, iPerryman fc ESA 
19971 ). Projecting these statistics to other large surveys allows estimating the number of 
EEs expected in survey databases: ~136,000 in ASAS, ^56,000 in the OGLE LMC field 
-16,000 in OGLE SMC field. 



80,000 in TASS fiDroege et al.l 120061 ). etc 



Gaia will make 

a revolution in these numbers s ince the aimed census of the overall stellar population is ~1 
billion down to = +20 mag (IPerryman et al.l 120011 ). yielding millions of EEs and tens of 
millions of variable stars. Finally, with LSST essentially complete to = 24.5, the yield of 
EEs will reach the tens of millions. Even if all observational facilities collapsed at that point 
so that no further data got collected, it would still take 500 years for all 12,500 members of 
the lAU to analyze these data! Given the unique capability of EEs to yield accurate stellar 
masses, radii, temperatures and distances, and realizing that many of these are accessible 
by small-size ground instruments, EEs should definitely hold one of the top positions on 
observational candidates list. 



'^A comprehensive list of more than a hundred such facilities may be found at 



http : //www . astro . physik . uni-goettingen . de/~hessmcin/MONET/links . html 
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Figure 1: ASAS project's automated data acquisition pipeline ( jPojmanskilllQQTI ). 



2 Brief review of process automation 

Data acquisition is the most automated aspect of the pipehne. An example of a fully auto- 
matic data acqui sition and analysis pipeline is that of the All-Sky Automated Survey (ASAS, 
Pojmanskil 119971 ) . depicted in Figure [H The level of sophistication is already such that it 
assures accurate and reliable data from both ground-based and space surveys. 

Variability classification, however, has proved to be much more involved than perhaps 
initially expected. Fundamentally different objects (i.e. radial pulsators and ellipsoidal 
variables) produce essentially indistinguishable light curves and follow-up spectroscopy is 
pa ramount for iden ti fying t heir true nature. A series of sys t emat ic an alyses were conducted 
by Rucinski ( 1997bl lal. ll998 ) and later Maceroni fc Rucinski ( 1999 ) and Rucinski &: Maceroni 



( 12001 1) that highlighted the importance of the Fourier decomposition technique (FDT) for 
classification of variable stars. The technique itself - fitting a 4*^^ order Fourier series to 
phased data curves and mapping dif ferent types of v ariables in Fourier coefficient space - 



notab ly for classifying ASAS data (jPojmanski 



was originally proposed for EBs bv iRucinskj ( 1973 ) and has been used e ve r since, most 



2002 : IPaczvhski et all 1200611. lAlcock et al. 



a new decimal classification scheme for categorizing EB types. IWyrzykowski et al.l ( 



(1l997h . analyzing 611 bright EBS from the MACHO database (ICook et al.lll995h. pro posed 



2003 



2004 ) identified 2580 EBs in the LMC and 1351 EBs in the SMC. They employed a novel 
classification approach using Artificial neural networks (ANN) as an image recognition al- 
gorithm, based on phased data curves that have been converted to low-resolution images as 
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Figure 2: Conversion of phased light curves (left) to 70x15 pix i mages (right), which are fe d 
to the neural network image recognition algorithm. Taken from IWyrzykowski et al.l (120031 ). 



depicted on Figure [2J Their classification pipeline was backed up by visual examinations of 
results. 



Approac hes to automating light curve solutions have taken various forms to date. IWyithe fc Wilson 
( 200ll . I2OO2I ). in their work to establish the best distance indicators among detached and 



semi-detached binaries in the Small Magellanic Cloud, obtained starting parameters for 
the rigorous WD model by comparing each candidate light curve with a set of template 
model light curves, sending the best match to an automated version of the differential 
corrector program (DC). The latter could be computationally prohibitive to apply to the 
expected large future data-sets. Employing l ess rigorous physica l models, of course, is one 
approach to c omputational efficienc y. Thus, iTamuz et al.l (120061 ) employ the EBOP ellip- 
soidal model (jPopper &: Etzell Il98ll ). Using this engine, they arrive at an initial solution 
after a comb ination of grid search, gradient descent and geometrical analysis of the LC. 
Devon (120051 ) illustrates an automated pipeline employing a simple model of sph erical stars 



without tidal or reflection physics, whose starting values are similarly obtained. iPrsa et al. 



( I2OO8I ) have devised a neural network based engine EBAI (Eclipsing Binaries via Artificial 
Intelligence; http : //www . eclipsingbinaries . org) that is capable of processing thousands 
of light curves in just a few seconds; it yields principal parameters of the analyzed variable. 
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3 Artificial Intelligence 



Advances in Artificial Intelligence (AI) and the continued operation of Moore's law that pre- 
dicts exponential growth of the processing power have created the opportunity for significant 
progress in solving the types of problems that are limited by the lack of human capital. A 
new approach, the Intelligent Data Pipeline (IDP), has been prototyped in the domain of 
EBs which uses AI techniques to operate autonomously on large observational data-sets to 
produce results of astrophysical value. The IDP is designed to handle the complete process 
of variable discovery, classification of variability and management of the solution process for 



the discovered EBs (jPevinney et al.ll2005l . l2006l : cf. Fig. [3]). The IDP employs ANNs in the 



processing modules, while the supervisory knowledge, now implicit in humans, is encoded in 
control modules as rules appropriate for each processing module. The supervisory modules 
have the task of keeping the process on track and providing physically meaningful results 
through each phase of the processing pipeline. 

ANNs are very simple algorithms that involve little beyond summation and multiplica- 
tion, while having the capability of being trained on a physical content. While some may 
find the opaqueness of Artificial Neural Networks (ANN) problematic, their success in many 
areas, including classification, real-time robotic control and others is a powerful answer. 
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Figure 3: Intelligent Data Pipeline (IDP). Complete survey data is piped through a period 
finder algorithm that is controlled by a rule-based system. All variable sources are then 
passed to the ANN-based classifier. Light curves consistent with EB signatures are passed 
to the Solution Estimator block. 
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In their basic form, ANNs are systems of multiple layers (Fig. H]). Each layer consists 
of a given number of independent units. Each unit holds a single value. These values are 
propagated from each unit on the current layer to all units on the subsequent layer by 
weighted connections. Propagation is a simple linear combination yi = J2jWijXj, where Xj 
are the values on the current layer, Wij are weighted connections, and yi are the values that 
enter the subsequent layer. Before they are stored in their respective units, yi are first passed 
through a (non-linear) activation function Aj. This function, typically a sigmoid function 
- Af{yi) = 1/[1 + exp(— (?/j — /u)/!")] - introduces non-linear properties to the network. 
Coefficients fi and r are selected so that Af{yi) fall in the (—1, 1) interval. It is this value 
that is stored in the i-th unit on the subsequent layer. Layers in the three-layer network are 
usually denoted input, hidden, and output layer. ANN is thus a non-linear mapping from 
the input layer to the output layer. In our implementation in the domain of EBs, the ANN 
maps the input light curves to the output set of principal physical parameters. 

Training the network implies determining the weights Wij on weighted connections. The 





Input values of parameters 0-C differences 



Figure 5: ANN performance on a sample of 10,000 LCs. Left: comparison between the 
input parameters (known from generating a sample) and output parameters provided by the 
network. T1.2 are effective temperatures of EB components, pi.2 are their fractional radii, e 
is eccentricity, uj is argument of periastron, and i denotes inclination. Parameters are offset 
by 0.5 for clarity and a guideline is provided for easier comparison. Right: distribution of 
residuals (main graph) and their cumulative distribution (inset). The bars depict the fraction 
of EBs with errors between 0% and 2.5% (first bin), 2.5% and 5% (second bin), etc. 90% of 
all LCs have errors smaller than 10% in all parameters. 

back-propagation algorithm relies on a sample of LCs (the training set) with known physical 
parameters; these are called exemplars. All LCs are propagated through the network and 
their outputs are compared to the known values. The weights are then modified so that the 
discrepancy between the two sets is minimized. This is an iterative process that needs to 
be done only once. After training, the network is ready to process any input LC extremely 
quickly. 

To evaluate ANN performance, we created a set of 10,000 synthetic light curves for 
eclipsing binary stars and passed it through a trained ANN. To each LC we added variable 
amounts of white noise, simulating different S/N ratios. Fig. [5] depicts the results that show 
clear statistical viability: 90% of the sample resulted in parameters with errors less than 
10%. The success rate of recognition is comparable to that of the learning sample, and the 
underlying distribution of errors for both data-sets is indistinguishable. This demonstrates 
the capability of the ANN to successfully recognize data it has never seen before. 
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4 Discussion 

The importance of results that will be achieved by developing novel fully automated ap- 
proaches can hardly be overstated. In the domain of EBs, their analysis yields: 

• calibration-free physical properties of stars (i.e. masses, radii, surface temperatures, 
luminosities) ; 

• accurate stellar distances; 

• precise stellar ages; 

• stringent tests of stellar evolution models. 

The products of state-of-the-art EB modeling are seminal to many areas of astrophysics: 

• calibrating the cosmic distance scale; 

• mapping of clusters and other stellar populations (e.g. star-forming regions, streams, 
tidal tails, etc) in the Milky Way; 

• determining initial mass functions and studying stellar population theory; 

• understanding stellar energy transfer mechanisms (including activity) as a function of 
temperature, metallicity and evolutionary stage; 

• calibrating stellar color-temperature transformations, mass-radius-luminosity relation- 
ships, and other relations basic to a broad array of stellar astrophysics; 

• studying stellar dynamics, tidal interactions, mass transfer, accretion, chromospheric 
activity, etc. 

In addition, spectroscopic surveys such as RAVE, SEGUE and Hermes will provide obser- 
vations of thousands of spectroscopic binaries that will allow the determination of metallicity, 
leading to chemical tagging, galactic stratigraphy and population memberships. 

The enormous inflow of data that marked the previous decade exposed the deficiencies of 
the analysis tools in this decade. Manual analysis will have to be limited to the astrophysi- 
cally most interesting cases; all other sources will need to be processed in a fully automated 
fashion. In this white paper we presented one possible approach to automation - artificial 
neural networks - that has a unique capability of processing hundreds of thousands of LCs in 
a matter of minutes. Suitable training data-sets will have to be created that would allow for 
a wide range of light curve types to be automatically processed. The community will need to 
invest significant effort to further develop automation methods and update the current tools 
to cope with this challenge. In addition, greater attention needs to be paid to intelligent 
components, such as expert systems, to insure appropriate fiow down the data pipeline. 
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5 Recommendations 



Our recommendations to the Decadal Survey 2010 regarding actions that need 
to be taken in order to address the challenges pointed out above are: 

• form a dedicated center (such as MAST or HEASARC) that would address 
the issues of data analysis automation; such a center would employ 3-5 FTE 
in softwcire engineering and theoretical scientific modeling; 

• form a Ucirrowly focused lAU commission that would steer community ef- 
forts - i.e. the National Virtual Observatory (NVO) interface, the choice 
of a programming language (i.e. python), specifications for application de- 
ployment, a well-defined suite of test cases, etc; 

• organize regular workshops and splinter meetings at the AAS and lAU 
symposia to address the application of Artificial Intelligence and other fully 
automated methods in astronomy data mining; 

• secure adequate funding through NSF/NASA for technology research and 
implementation through specialized calls for proposals. 
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